Client Report - Home Built Date Prediction Using ML Model, Project 4

Course DS 250

Author

Adam Ulrich

Elevator pitch

In this project, we explore the training of an ML model to predict if a home was built pre-1980 based on the other columns. Datasets can be dirty or missing data, and pre-1980 homes may have asbestos. The goal is to provide a trained model that can predict with an accuracy of at least 90% whether a home was built pre/post 1980 based on the other data.

read data clean data
df = pd.read_csv("dwellings_denver.csv")

#clean up dirty data
df['condition'].replace("AVG",'average', inplace=True)
df['floorlvl'].replace(np.nan,0,inplace=True)
df['gartype'].replace(np.nan,"None", inplace=True)
df['pre-1980']=df['yrbuilt'] < 1980

Question 1| Relationship Charts

Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.

evaluate arcstyle
#create dataframe for comparing neighborhood to year built
nbhd_yb = df[['arcstyle','pre-1980']]
nbhd_yb = nbhd_yb.sort_values('arcstyle')
nbhd_yb['arcstyle'] = nbhd_yb['arcstyle'].astype('string')

# show chart
chart1 = px.histogram(nbhd_yb,
                    x='arcstyle', 
                    color='pre-1980', 
                    title='Pre/Post 1980 Homes per Architecture Style',
                    labels={'arcstyle':'Architecture Style'}
                    )
chart1.show()


The chart above shows a reasonable correlation between architecture style and year built. Blue bars represent total home count in each style prior to 1980, the stacked red bar above the blue is for after 1980. Clearly most homes prior to 1980 were one-story. However other datapoints are quite split (end unit, middle unit), making this data somewhat useful, but not a great predictor of year built.

evaluate nbhd
nbhd_yb = df[['nbhd','pre-1980']]
nbhd_yb = nbhd_yb.sort_values('nbhd')
nbhd_yb['nbhd'] = nbhd_yb['nbhd'].astype('string')

# show chart
chart2 = px.histogram(nbhd_yb,
                    x='nbhd', 
                    color='pre-1980', 
                    nbins=800, 
                    range_y=([0,700]), 
                    title='Pre/Post 1980 Homes per Neighborhood',
                    labels={'nbhd':'Neighborhood Code' }
                    )
chart2.show()


The chart above shows a strong correlation between neighborhood and year built. Blue bars represent total home count in each neighborhood prior to 1980, the stacked red bar above the blue is for after 1980. Not surprisingly, our bars tend to be mostly red or mostly blue as neighborhoods tend to be built generally during the same time period. This appears to be a very good predictor.

Task 2| Model Building

Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.

%% Flowchart Load to Score
flowchart LR
  A[Load Data] --> B(Clean Data)
  B --> C(Encode Categorical Data)
  C --> D(Classify/Select Columns )
  D --> E(Split Data for train/test)
  E --> F(Training Data)
  F --> G(Train Model)
  E --> H(Testing Data)
  G --> I(Test Model)
  H --> I(Test Model)
  I --> J(Score Model)

read and format data
#define columns we will use for training and testing, x and y
columns =['nbhd', 'quality', 'stories', 'gartype', 'numbaths', 'arcstyle']
columns_to_encode =['quality', 'gartype', 'arcstyle']

x = df[columns]
y = df['pre-1980'] 

# encode columns
x_encoded = x
for c in columns_to_encode:
    x_encoded = encode_column(x_encoded,c)

#create the model
model = DecisionTreeClassifier()

model.fit(x_encoded,y)

#identify important features
selected_model = SelectFromModel(model, prefit=True)
x_encoded_selected = selected_model.transform(x_encoded)

# create model for the selected set
model_selected_by_model = DecisionTreeClassifier()
model_selected_by_model.fit(x_encoded_selected,y)

# create empty lists for returned accuracy, precision and feature pct

# 6 columns
results_accuracy_6columns = []
results_precision_6columns = []
results_feature_pct_6columns = []

# selected by model
results_accuracy_selected_by_model = []
results_precision_selected_by_model = []
results_feature_pct_selected_by_model = []

# run the test n times, store the data against the selected model and encoded model.
result_count = 25
row_count = int(math.sqrt(result_count))

model_list = [[model, [
                results_accuracy_6columns, 
                results_precision_6columns, 
                results_feature_pct_6columns], 
                x_encoded
            ],
            [model_selected_by_model, [
                results_accuracy_selected_by_model,
                results_precision_selected_by_model,
                results_feature_pct_selected_by_model],
                x_encoded_selected
            ]
            ]

while len(results_accuracy_6columns) < result_count:

    

    for m, datasets, column_list in model_list:

        #split the data
        x_train, x_test, y_train, y_test = train_test_split(column_list,y)

        # run the fit and score
        accuracy_result, precision_result, feature_result = train_test_model(m,
                    x_train,
                    x_test,
                    y_train,
                    y_test)
    
        # append results
        datasets[0].append(accuracy_result)
        datasets[1].append(precision_result)
        datasets[2].append(feature_result)



# test accuracy
x_train, x_test, y_train, y_test = train_test_split(x_encoded,y)

Based on analysis of individual column score results (I ran evaluations against each column’s scoring accuracy, and then I retained all columns that were greater than 10%), the columns initial columns selected were:

['nbhd', 'quality', 'stories', 'gartype', 'numbaths', 'arcstyle']

Data cleaning was applied to floorlvl and gartype to deal with NaNs. Because we have categorical data in the dataset, I then ran the dataframe through an encoder to translate to numeric values, which increased the column count from 6 to 313.

However, I was unhappy with OneHotEncoder creating new columns, and renaming existing columns, so I built my own encoder function that identified the unique values in a column, and translated it to numerical data.

_The data set was then run through the FeatureSelection.SelectFromModel algorithm, which reduced the columns from 6 down to just 2.

After trying the linear, random forest, and Gaussian Naive Bayes regressors, I realized that a better solution was using a classifier instead. The Linear Classifier provided about 70% accurate, GaussianNB was 80%, but quite slow. Random Forest was also slower. I ended up settling on the Decision Tree Classifier.

The data was then split into training and test segments using the train\_test\_split method.

Task 3| Model Justification

Justify your classification model by discussing the most important features selected by your model. This discussion should include a chart and a description of the features.

Feature Importance Data

justify model
# create a dataset for features
features = pd.DataFrame(results_feature_pct_6columns)
features.loc['mean'] = features.mean()
features.columns = list(x_encoded.columns)

# display feature data
features.style 
  nbhd quality stories gartype numbaths arcstyle
0 0.525007 0.141155 0.011048 0.052282 0.038346 0.232163
1 0.360813 0.125225 0.015635 0.130463 0.036368 0.331496
2 0.351498 0.133638 0.014956 0.132875 0.035302 0.331732
3 0.357352 0.128381 0.015382 0.131335 0.038884 0.328665
4 0.358454 0.126709 0.015512 0.127319 0.039100 0.332906
5 0.526874 0.142876 0.013335 0.051844 0.041938 0.223133
6 0.356214 0.129278 0.017047 0.127861 0.039646 0.329954
7 0.525127 0.142025 0.016352 0.051131 0.037306 0.228060
8 0.523773 0.143094 0.012764 0.055772 0.039377 0.225221
9 0.526472 0.144978 0.016064 0.051316 0.036629 0.224542
10 0.355344 0.124951 0.015990 0.127832 0.041918 0.333966
11 0.357446 0.128426 0.014617 0.127419 0.040870 0.331222
12 0.358377 0.131559 0.012634 0.135303 0.033852 0.328274
13 0.353404 0.128809 0.014199 0.132494 0.033038 0.338056
14 0.355388 0.128922 0.017362 0.127458 0.037156 0.333713
15 0.363935 0.124918 0.013693 0.128932 0.040177 0.328346
16 0.358892 0.125455 0.017768 0.129662 0.037877 0.330346
17 0.522740 0.142487 0.013373 0.052628 0.039557 0.229215
18 0.536791 0.146672 0.012110 0.042805 0.034462 0.227159
19 0.353294 0.132847 0.016373 0.128507 0.038170 0.330810
20 0.358680 0.131867 0.016987 0.126063 0.035019 0.331384
21 0.525522 0.142920 0.012827 0.054661 0.037175 0.226895
22 0.362956 0.125198 0.016833 0.132257 0.036014 0.326743
23 0.358268 0.124196 0.017669 0.133187 0.036608 0.330071
24 0.355956 0.131804 0.014205 0.129473 0.036452 0.332111
mean 0.411543 0.133136 0.014989 0.104835 0.037650 0.297847

Feature Importance Summary

To reduce variance between training runs with unique data sets, I ran 25 unique data set splits and generate feature data, we see that neighborhood and architecture style are well above the other feature importance values. Quality and garage type are also reasonbly important.

justify model
# create dataframe for showing pie chart
features_means = features.mean()

features_pie = pd.DataFrame(zip(list(x_encoded.columns),list(features_means)))
features_pie.columns = ["feature", 'percentage']

# show pie chart
feature_chart = px.pie(features_pie,values='percentage', names = 'feature')
feature_chart.show()

Task 4| Model Quality

Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.

Accuracy Scoring Data for Model Selection Columns

statistical summary for selected columns
# create a dataframe from the result for both the 6 column and selected columns
results_df = pd.DataFrame(results_accuracy_6columns)
results_df.columns = ['score']

results_df_selected = pd.DataFrame(results_accuracy_selected_by_model)
results_df_selected.columns = ['score']

# reshape the datapoints for a grid display 
df_grid = pd.DataFrame(results_df.to_numpy().reshape(row_count,row_count))
df_grid_selected = pd.DataFrame(results_df_selected.to_numpy().reshape(row_count,row_count))

# create Title

# set color
cm = sns.light_palette("blue", as_cmap=True)

#show table
df_grid_selected.style \
    .hide(axis='columns') \
    .format(precision=3) \
    .background_gradient(cmap=cm) \
    .set_table_styles([{
        'selector': 'caption',
        'props': [
            ('color', 'blue'),
            ('font-size', '25px')
        ]
    }])
0 0.916 0.914 0.916 0.913 0.917
1 0.916 0.916 0.912 0.913 0.918
2 0.918 0.913 0.915 0.912 0.914
3 0.914 0.920 0.917 0.918 0.909
4 0.916 0.921 0.915 0.920 0.918

Accuracy Scoring Data for 6 Columns

statistical summary for 6 columns
# create Title

# set color
cm = sns.light_palette("blue", as_cmap=True)

#show table
df_grid.style \
    .hide(axis='columns') \
    .format(precision=3) \
    .background_gradient(cmap=cm) \
    .set_table_styles([{
        'selector': 'caption',
        'props': [
            ('color', 'blue'),
            ('font-size', '25px')
        ]
    }])
0 0.942 0.949 0.952 0.948 0.947
1 0.940 0.951 0.945 0.948 0.950
2 0.947 0.947 0.946 0.948 0.948
3 0.945 0.945 0.953 0.946 0.948
4 0.950 0.957 0.950 0.953 0.950

Accuracy Summary Analysis

statistical summary 2
# describe the statistical data, and transpose for display
described_data = results_df.describe().transpose()[['count','mean','std','min','max']]
described_data = described_data.rename(columns={'std':'standard deviation'})

described_selected_data = results_df_selected.describe().transpose()[['count','mean','std','min','max']]
described_selected_data = described_selected_data.rename(columns={'std':'standard deviation'})

# create statistical data for use in narrative
mean = round(float(described_data['mean'].to_string().split()[1]),3)
standard_deviation = round(float(described_data['standard deviation'].to_string().split()[1]),3)
min_value = round(float(described_data['min'].to_string().split()[1]),3)
max_value = round(float(described_data['max'].to_string().split()[1]),3)
mean_selected = round(float(described_selected_data['mean'].to_string().split()[1]),3)



# show chart
described_data.style.format({"count" : "{:,.0f}",
                 "mean" : "{:.3f}",
                 "standard deviation" : "{:.3f}",
                 "min" : "{:.3f}",
                 "max" : "{:.3f}"
                 }) \
            .set_table_styles([{
                'selector': 'caption',
                'props': [
                    ('color', 'blue'),
                    ('font-size', '25px')
                ]
            }])
  count mean standard deviation min max
score 25 0.948 0.004 0.940 0.957


I ran the train_test_split method, the fit method and finally score method against this column set 25 times to get a statistically significant data set. The first data set is from using just two columns as selected by the SelectFromModel selecter. The second is from using the 6 columns I had originally used.

The accuracy results came back on the 2 column data set at 0.916 . Comparing the results to my initial 6 column data set, the data shows that using 6 columns instead of 2 columns increases to 0.948.

In addition, the standard deviation across our samples was tiny at 0.004. Min and Max across our data was 0.94 and 0.957, respectively.

_The resulting model will successfully determine pre 1980 homes with a mean accuracy rate of 0.948. The 95% confidence interval would be (0.94, 0.956).

Accuracy is calculated as follows:

\[\begin{align} Accuracy& = {R_c \over T_t }\\ where\\ R_c& = Correct Responses\\ T_t& = Total Test Cases\\ \end{align}\]

Precision Scoring Data for 6 Columns

precision statistical summary for 6 columns
# create a dataframe from the result for both the 6 column and selected columns
results_df = pd.DataFrame(results_precision_6columns)
results_df.columns = ['score']

# reshape the datapoints for a grid display 
df_grid = pd.DataFrame(results_df.to_numpy().reshape(row_count,row_count))


#show table
df_grid.style \
    .hide(axis='columns') \
    .format(precision=3) \
    .background_gradient(cmap=cm) \
    .set_table_styles([{
        'selector': 'caption',
        'props': [
            ('color', 'blue'),
            ('font-size', '25px')
        ]
    }])
0 0.959 0.968 0.969 0.966 0.961
1 0.951 0.964 0.960 0.967 0.962
2 0.961 0.964 0.962 0.964 0.966
3 0.960 0.960 0.967 0.965 0.965
4 0.962 0.968 0.965 0.966 0.968

Precision Summary Analysis

precision statistical summary 2
# describe the statistical data, and transpose for display
described_data = results_df.describe().transpose()[['count','mean','std','min','max']]
described_data = described_data.rename(columns={'std':'standard deviation'})

described_selected_data = results_df_selected.describe().transpose()[['count','mean','std','min','max']]
described_selected_data = described_selected_data.rename(columns={'std':'standard deviation'})

# create statistical data for use in narrative
mean = round(float(described_data['mean'].to_string().split()[1]),3)
standard_deviation = round(float(described_data['standard deviation'].to_string().split()[1]),3)
min_value = round(float(described_data['min'].to_string().split()[1]),3)
max_value = round(float(described_data['max'].to_string().split()[1]),3)
mean_selected = round(float(described_selected_data['mean'].to_string().split()[1]),3)

# show chart
described_data.style.format({"count" : "{:,.0f}",
                 "mean" : "{:.3f}",
                 "standard deviation" : "{:.3f}",
                 "min" : "{:.3f}",
                 "max" : "{:.3f}"
                 }) \
            .set_table_styles([{
                'selector': 'caption',
                'props': [
                    ('color', 'blue'),
                    ('font-size', '25px')
                ]
            }])
  count mean standard deviation min max
score 25 0.964 0.004 0.951 0.969

The precision results came back on the 6 column data set at 0.964. Precision is calculated as below. Precision is useful as an indicator to ensure that we are not missing a significant numbers of false_positives. Our precision data here is excellent, even better than our accuracy.

\[\begin{align} Precision& = {P_t \over {P_t + P_f} }\\ where\\ P_t& = True Positives\\ P_f& = False Positives\\ \end{align}\]